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Abstract —Polar codes asymptotically achieve the symmetric 
capacity of memoryless channels, yet their error-correcting per¬ 
formance under successive-cancellation (SC) decoding for short 
and moderate length codes is worse than that of other modern 
codes such as low-density parity-check (LDPC) codes. Of the 
many methods to improve the error-correction performance 
of polar codes, list decoding yields the best results, especially 
when the polar code is concatenated with a cyclic redundancy 
check (CRC). List decoding involves exploring several decoding 
paths with SC decoding, and therefore tends to be slower 
than SC decoding itself, by an order of magnitude in practical 
implementations. In this paper, we present a new algorithm based 
on unrolling the decoding tree of the code that improves the speed 
of list decoding by an order of magnitude when implemented in 
software. Furthermore, we show that for software-defined radio 
applications, our proposed algorithm is faster than the fastest 
software implementations of LDPC decoders in the literature 
while offering comparable error-correction performance at sim¬ 
ilar or shorter code lengths. 

Index Terms —polar codes, list decoding, software decoders, 
software-defined radio, LDPC. 


I. Introduction 

Polar codes, proposed by Ankan (I], achieve the symmetric 
capacity of memoryless channels as the code length N —» °o 
using the low-complexity successive-cancellation (SC) decod¬ 
ing algorithm. Their error-correction performance, however, 
is mediocre for codes of short and moderate lengths (a few 
thousand bits) and is worse than that of other modern codes, 
such as low-density parity-check (LDPC) codes. To improve 
their performance, polar codes are concatenated with a cyclic 
redundancy check (CRC) as an outer code and decoded using 
the list decoding algorithm (“list-CRC”)- The resulting error- 
correction performance can exceed that of LDPC codes of 
similar length 0- 

However, list-CRC decoding comes with a downside: the 
sequential “bit-by-bit” decoding order of the SC algorithm 
limits the speed of practical implementations, which further 
decreases with increasing list size L. The complexity of 
SC decoding is 0(N log N), however a list decoder has a 
higher complexity of 0(LN log N). The result is that practical 
hardware and software implementations of list decoders have 
low throughput that is an order of magnitude lower than the 
fastest SC decoder hardware 0, which achieves information 
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throughout of 1.0 Gbps at 100 MHz in FPGA. The fastest 
belief propagation polar decoder is also faster: it achieves 
2.34 Gbps at 300 MHz in 65nm CMOS 0. On the other 
hand, reported hardware list decoder implementations achieve 
coded throughputs of 285 Mbps at 714 MHz for N = 1024 
and L = 2 0, and 335 Mbps at 847 MHz for N = 1024 and 
L = 2 ©. For a list size L — 16, the fastest decoder has a 
coded throughput of 220 Mbps at a clock frequency of 641 
MHz 0. 

The key to increasing the speed of SC decoders is to break 
the serial constraint imposed by successive cancellation. In |8j, 
it was recognized that certain decoding steps in SC decoding 
were redundant for certain groups of bits that could instead be 
estimated simultaneously, given appropriate implementations. 
In that approach, called simplified successive cancellation 
(SSC), groups of frozen bits do not need to be explicitly de¬ 
coded, since their values are already known (usually zero), and 
groups of information bits can be estimated by thresholding, 
instead of serial successive cancellation. When viewing the 
polar code in a tree representation, it is easy to see that the 
code is a concatenation of smaller constituent codes. Groups of 
frozen bits can be viewed as comprising a “Rate-0” code and 
information bit groups are a “Rate-1” code. Later work further 
increased the speed of SC decoding by parallel decoding some 
of the other “Rate-R” codes in the tree (3), 0. The Fast-SSC 
algorithm in [3j considers a variety of different constituent 
codes, such as single-parity-check (SPC) and repetition codes, 
decoding them with parallel hardware, estimating several bits 
per clock cycle. The first portion of this work describes how 
the Fast-SSC decoding algorithm was adapted for use in the 
context of list decoding. 

The second part describes how this algorithm performs 
when implemented on a general purpose processor using 
single-instruction multiple-data (SIMD) instructions. Such sys¬ 
tems were shown to have fast software SC decoders: the 
decoder in © employs inter-frame parallelism, decoding 
many frames in parallel, to achieve information throughput 
of 2.2 Gbps and latency of 26 /is. Alternatively, intra-frame 
parallelism targeting low-latency implementations was used 
by (TT) to reach an information throughput up to 1.3 Gbps 
with 1 /is latency. In addition, encoding of polar codes is a 
low complexity, 0(N log N), operation that is well suited for 
software implementation as it does not require permutation of 
data fl2] . 

The low encoding complexity combined with the good 
error-correction performance of list-CRC decoding will signif¬ 
icantly improve the communication ability of wireless sensors 
networks (WSN) using software-defined radio (SDR). The 
sensor nodes benefit from the ability to use shorter codes, 
reducing transmission time and energy as well as the ability to 
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reduce transmission power. Alternatively, instead of reducing 
transmission power, one can increase the distance between the 
nodes and base stations, reducing the number of base stations 
in the process. The nodes also benefit from the very low 
complexity of polar encoders G3- 1 13]] . The base stations, 
which generally have less stringent energy requirements, can 
use general purpose processors, including SIMD capable em¬ 
bedded ARM processors, to implement the proposed list-CRC 
decoding algorithm and to process data on site. This reduces 
the cost and development time of the WSN and increases 
its flexibility as a result of the SDR components. This work 
could also be used in other SDR applications that do not have 
the scale to justify a custom hardware implementation but 
where a throughput in the tens of Mbps is desirable. Quantum 
key distribution is such an example where a general purpose 
processor or a graphics processing unit is used to perform 
error correction G3- Multiple SDR systems either include 
a general purpose Intel processor or must be connected to 
a computer 03-03, providing target platforms where our 
proposed algorithm can be used. 

This work expands and improves on previous conference 
publications lu and CD- The algorithm in this paper has 
been reformulated in terms of log-likelihood ratios (LLRs), 
which yields speed improvements over the preliminary work 
in 03- Furthermore, the conference paper implemented a 
list decoding algorithm based on SSC decoding (list-SSC), 
while this work develops the Fast-SSC algorithm for list 
decoding (list-Fast-SSC) and implements it, yielding further 
performance improvements. In addition, a general path metric 
is derived from codeword likelihoods, which is then used as 
the basis for calculating all the specialized decoders’ output 
metrics. Finally, unrolling (19) is applied to list decoders in 
this work. The results show that our improved list decoding 
algorithm results in a speedup of 11.9 times compared to 
LLR-based list-SC decoding. In addition to the decoder in 
[18), those of |20) and |21j also perform multi-bit decisions 
and decoding. The main difference between them and the 
proposed decoder is that the former perform multi-bit decision 
for any constituent code of length M bits using an exhaustive- 
search decoding algorithm. Whereas the proposed decoder 
uses specialized, low-complexity algorithms to decode any 
constituent code to which these algorithms apply, regardless 
of the constituent code length. A version of ED limited to 
2-bit constituent codes appears in (22) . 

It should be noted that multi-bit decoding for Rate-1, Rate- 
0, and repetition constituent codes was proposed in the context 
of likelihood-based Reed-Muller (RM) decoders [23] and 
likelihood-based RM list decoders ]24) . This work differs from 
[24) in that it targets LLR-based list decoders in the context of 
polar codes and recognizes more special constituent multi-bit 
decoders. The algorithms introduced in this work focus on low 
implementation complexity, especially for SIMD processors 
and parallel hardware. 

This work starts by reviewing the construction of polar 
codes and the list-CRC and the Fast-SSC decoding algorithms 
in Section [II] We then describe how to generate a software 
polar decoder amenable to vectorization in Section II-D 


and a software implementation is described in Section [Iv] 
The speed and error-correction performance of the proposed 
decoder are studied in Section [VI] and compared to those of 
LDPC codes of the 802.3an ]25| and 802.1 In [26J standards 
in Section VII In the second comparison, we show that polar 
codes can match or exceed the speed and error-correction 
performance of software LDPC decoders while using shorter 
codes. 


II. Background 


A. Polar Codes 


A polar code of length N is constructed recursively from 
two polar codes of length N/ 2. Successive-cancellation (SC) 
decoding provides a bit estimate u, using the channel output 
-1 and the previously estimated bits u'F 1 according to 


10 when Pr [y,u' 0 l \ut = 0] > Pr [y,u' Q *|«,■ = 1]; 
1 otherwise. 


( 1 ) 


As N —> °o, the probability of correctly estimating a bit ap¬ 
proaches 1 or 0.5. This is the channel polarization phenomenon 
that is exploited by polar codes, which use reliable bit locations 
to store information bits and set the unreliable, called frozen, 
bits to zero. As a result, when the SC decoder is estimating a 
bit ut, it is zero if the bit is frozen, or is calculated according 
to Q. 

Fig. [Ta] shows the graph of an (8, 4) polar code where frozen 
bits are labeled in gray and information bits in black. The SC 
decoder can also be viewed as a tree that is traversed depth 


first. Such a tree is illustrated in Fig. lb where each sub-tree 
corresponds to a constituent code. The white nodes correspond 
to frozen bits, and the black ones to information bits. The gray 
nodes represent the concatenation operations combining two 
constituent codes. 

Two types of messages are passed along the edges of the tree 
in the decoder: soft reliability values—LLRs in this work,— 
a , and hard bit estimates, p. When a node corresponding to 
a constituent code of length N v receives the reliability values 
from its parent, represented using LLRs, the output to its left 
child is calculated according to the F function: 

a,[i\ = F(a v [i],a v [i + N v / 2]) 

= 2atanh(tanh(a v [i]/2) tanh(a v [i + N v /2]/2)) 

~ sgn(a,.[/])sgn(a,,[/ + V„/2])min(|a,[/]|, \a v [i + N v /2]\)- 

( 2 ) 

where the approximation is the min-sum approximation. 

Once the output of the left child [3/ is available the message 
to right one is calculated using the G function 

a r [i] = G(a v [i], a v [i + N v /2], fr[i]) 

= a i .[i + N v /2\-(2p,[i]-l )« v [i]. (3) 

Finally, when j3 r is known, the node’s output is computed as 


fl m = (AH © AH when i < N v/2\ 
' j/3,-[/ - N v /2] otherwise; 


(4) 


Section III introduces the proposed list decoding algorithm 


where © is an XOR operation and we refer to the operation 
as the Combine operation. 
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Fig. 1: The graph of an (8, 4) polar code and its corresponding 
tree representation. 


The output J3,, of a frozen node is always zero, and is 
calculated using threshold detection for an information node: 


A' 



when a v > 0; 
otherwise. 


(5) 


B. List-CRC Decoding 

When estimating an information bit, a list decoder continues 
decoding along two paths, the first assumes that ‘0’ was the 
correct bit estimate, and the second ‘1’. Therefore at every 
information bit, the decoder doubles the number of possible 
outcomes up to a predetermined limit L. When the number 
of paths exceeds L, the list is pruned by retaining only the 
L most reliable paths. When decoding is over, the estimated 
codeword with the largest reliability metric is selected as the 
decoder output. It was observed in |2j that using a CRC as 
the primary criterion for selecting the final decoder output, 
increased the error-correction performance significantly. In 
addition, the CRC enables the use of a adaptive decoder where 
the list size starts at two and is gradually increased until the 
CRC is satisfied or a maximum list size is reached dz). 

Initially, polar list decoders used likelihood [!2j and log- 
likelihood values |28) to represent reliabilities. Later, log- 
likelihood ratios (LLRs) were used in @ to reduce the 
amount of memory used by a factor of two and to reduce 
the processing complexity. In addition to the messages and 
operations presented in Section II-A| the algorithm of |[6} 
stores a reliability metric PMJ for each path / that is updated 
for every estimated bit i according to: 


PM) = 


pm; 

PM' 


1 

1 


cu 


if Hi = h(a v ), 
otherwise. 


( 6 ) 


It is important to note that the path metric is updated when 
encountering both information and frozen bits. 


C. Fast-SSC Decoding 

The SC decoder traverses the code tree until reaching leaf 
nodes corresponding to codes of length one before estimating 
a bit. This was found to be superfluous as the output of sub¬ 
trees corresponding to constituent codes of rate 0 or rate 1 of 
any length can be estimated without traversing their sub-trees 



Fig. 2: Fast-SSC decoder graph for an (8,4) polar code. 


|j8j. The output of a rate-0 node is known a priori to be an 
all-zero vector of length N v \ while that of rate-1 can be found 
by applying threshold detection element-wise on a v so that 


AM = h(a v [i}) = 


when a v [i] > 0; 
otherwise. 


(7) 


The Fast-SSC algorithm utilizes low-complexity maximum- 
likelihood (ML) decoding algorithms to decode constituent 
repetition and single-parity check (SPC) codes instead of 
traversing their corresponding sub-trees 0, 

The ML-decision for a repetition code is 

= » h “^a,W>0; (8) 

11 otherwise. 

The SPC decoder performs threshold detection ([7]) on its 
output to calculate the intermediate value HD. The parity of 
HD is calculated using modulo-2 addition and the least reliable 
bit is found according to 


j = argmin|a,,[y]|. 

j 

The final output of the SPC decoder is 

= |HD[/] © parity when i = j; 

' |hD[/] otherwise. 

Fig. [2]shows a Fast-SSC decoder tree for the (8, 4) code, in¬ 
dicating the messages passed in the decoder and the operations 
used to calculate them. 

The Fast-SSC decoder and its software implementation 0 
utilize additional specialized constituent decoders that are not 
used in this work as they did not improve decoding speed. 
In addition, the operations mentioned in this section and 
implemented in 0 present a single output and therefore 
cannot be applied directly to list decoding. In this work, we 
will show how they are adapted to present multiple candidates 
and used in a list decoder. 


D. Unrolling Software Decoders 

The software list decoder in 0 is run-time configurable, 
i.e. the same executable is capable of decoding any polar 
code without recompilation. While flexible, this limits the 
achievable decoding speed. In 0- it was shown that gen¬ 
erating a decoder for a specific polar code yielded significant 
speed improvement by replacing branches with straight-line 
code and increasing the utilization of SIMD instructions. This 
process is managed by a developed CAD tool that divides the 
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Listing 1 Loop-based (8, 4) Fast-SSC Decoder 

for (unsigned int i = 0; i < operation _count; ++i) { 
operation_processor = fetch_operation_processor(i); 
operation_processor.execute(a,,, &a r , &ai. Pi, p r . &/3,.); 

} 


Listing 2 Unrolled (8, 4) Fast-SSC Decoder 

cti = F<8>(a c ); 
p\ = Repetition<4>(ai); 
a 2 = G<8>(a c , J3i); 
p 2 = SPC<4>(« 2 ); 

Pc = Combine<%>(P\, j3 2 ); 


process into two parts: decoder tree optimization, and C++ 
code generation. 

For the list decoder in this paper we applied this optimiza¬ 
tion tool using a subset of the nodes available to the complete 
Fast-SSC algorithm: Rate-0 (Frozen), Rate-1 (information), 
repetition, and SPC nodes. The decoder tree optimizer tra¬ 
verses the decoder tree starting from its root. If a sub-tree 
rooted at the current node has a higher decoding latency than 
an applicable Fast-SSC node, it is replaced with the latter. If 
there are not any Fast-SSC nodes that can replace the current 
tree, the optimizer moves to the current node’s children and 
repeats the process. 

Once the tree is optimized, the corresponding C++ code is 
generated. All functions are passed the current N v value as a 
template parameter, enabling vectorization and loop unrolling. 

Listings [I] and [2] show a loop-based decoder and an unrolled 
one for the (8, 4) code in Fig. [2] respectively. In the loop- 
based decoder, both iterating over the decoding operations 
and selecting an appropriate decoding function (called an 
operation processor) to execute involve branches. In addition, 
the operation processor does not know the size of the data 
it is operating on at compile-time; and as such, it must have 
another loop inside. The unrolled decoder can eliminate these 
branches since both the decoder flow and data sizes are known 
at compile-time. 

III. Proposed List-Decoding Algorithm 

When performing operations corresponding to a rate-R 
node, a list decoder with a maximum list size L performs 
the operations F§.G§ and Combine Q on each of the 
paths independently. It is only at the leaf nodes that interaction 
between the paths occurs: the decoder generates new paths and 
retains the most reliable L ones. 

A significant difference between the baseline SC-list de¬ 
coder and the proposed algorithm is that each path in the 
former generates two candidates, whereas in the latter, the 
leaf nodes with sizes larger than one can generate multiple 
candidates for each path. 

All path-generating nodes store the candidate path reliability 
metrics in a priority queue so that the worst candidate can be 
quickly found and replaced with a new path when appropriate. 
This is an improvement over & where path reliability 
metrics were kept sorted at all times by using a red-black 


Algorithm 3 Candidate selection process 

for s e sourcePaths do 
Generate candidates. 

Store reliability of all candidates except the ML one. 
Store ML decision. 

end for 

for p e candidates do 

if fewer than L candidates are stored then 
Store p. 

else if PM', < min. stored candidate reliability then 
Replace min. reliability candidate with p. 

end if 
end for 


(RB) tree. The most common operation in candidate selection 
is locating the path with the minimum reliability, which is 
an 0(log L) operation in RB-trees, the order of the remaining 
candidates is irrelevant. A heap-backed priority queue provides 
0(1) minimum-value look up and 0(log L) insertion and 
removal, and is therefore more efficient than an RB tree for 
the intended application. 

In this section, we describe how each node generates 
its output paths and calculates the corresponding reliability 
metrics. The process of retaining the L most reliable paths is 
described in Algorithm [3] Performing the candidate selection 
in two passes and storing the ML decisions first are necessary 
to prevent candidates generated by the first few paths from 
overwriting the input for later ones. 

A. Candidate Generation and Reliability 

The aim of the proposed algorithm is to directly generate 
candidates without traversing sub-trees whenever possible. To 
achieve this goal, we use the candidate-enumeration method 
of Chase decoding [29J to provide a list of candidate paths at 
the output of a rate-1 decoder. 

The log-likelihood of a candidate codeword Pj is 

l(p J ) = Y J (] -2pj[i])a v [i] 

i 

= 2] (1 -2/3y[f])sg n C« v .[f])|« v [f]| 

i 

= ^(1 - 2 \Pj[i] - h(a v [i\)\) |0£,,[i]| (10) 

i 

The factor 

(1 - 2/3 / H)sgn(a v M) = 1-2 1 pj[i] - h(a v [i\)\ 

+ 1 when Pj[i] = h(a v {i]), 

-1 otherwise. 

The ML candidate codeword is 

Pml = argmax/(j3,). 

ft£C 

where C is the set of all codewords. The other candidates are 
generated by flipping bits relative to the ML decision, both 
individually and simultaneously, subject to the constraint that 
the candidate is a valid codeword. 
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To ensure that the codeword log-likelihood values remain < 
0, we offset by |a v [i]|. In addition, we scale the metric 
by a factor of 0.5. The resulting codeword metric becomes 

V(P) = m-Zifo-vn 

_ 2/(1 - 21 pjlil - /*(«,, M)|) | a v [i]\ ~ 2i let,, mi 
2 

Z i a-2\p j [i]-Ha v [i])\-l)\aM 

2 

| l«vWI - (ID 

i 

This metric states that a codeword is penalized for any 
difference between it and the vector calculated from a r using 

0 . 

When starting from a source path s with reliability PM' , 
the reliability of the path corresponding to the codeword /j ; is 

PM' = PM'- 1 - | Pj[i\ - h(a v [i])\ \a v [i]\ . (12) 

i 

All specialized decoders generate their candidates based on 
this metric by restricting potential codewords. 


B. Rate-0 Decoders 


Rate-0 nodes do not generate new paths; however, like their 
length-1 counterparts in SC-list decoding |6), they alter path 
reliability values. In |6j, the path metric was updated according 
to 


PM)-' if h(a v ) = 0, 

PM) -1 - |a,,| otherwise. 


The all-zero codeword, pj[i] = 0, Vi, is the only valid 
codeword. Therefore, based on m the output path metric 
is 

PM' = PM;- 1 - 2 h(a v [i])\a v [i]\. (13) 

i 

An alternative formulation for m is 

N v -1 

pm; = pm;- 1 - ^ I max(cai], 0)|. (14) 

1=0 


C. Rate-1 Decoders 

A decoder for a length N v rate-1 constituent code can 
provide up to 2 Nv candidate codewords. This approach is 
impractical as it scales exponentially in N v . The Chase-II 
decoding algorithm considers only a limited set of the least- 
reliable bits to generate its candidates |29) . We use the same 
method to limit the complexity of rate-1 decoders when 
enumerating the candidates selected for consideration in ( fl2| . 

The maximum-likelihood decoding rule for a rate-1 code is 
0. Additional candidates are generated by flipping the least 
reliable bits both independently and simultaneously. Empiri¬ 
cally, we found that considering only the two least-reliable 
bits, whose indexes are denoted mini and mini, is sufficient 
to match the performance of SC list decoding. Therefore, for 


each source path s, the proposed rate-1 decoder generates four 
candidates with the following reliability values 

PMq = PM'- 1 . 

PM; = PM'- 1 - | OCv [min i ] |, 

PM; = PM'- 1 - learning, 

PM; = PM'- 1 - |a t .[mini]| - |a,[min 2 ]|; 

where PM; corresponds to the ML decision, PM', to the ML 
decision with the least-reliable bit flipped, PM' 2 to the ML 
decision with the second least-reliable bit flipped, and PM' 3 to 
the ML decision with the two least-reliable bits flipped. 


D. SPC Decoders 


Codewords of an SPC code must satisfy the even parity 
constraint, i.e. Pj[i] = 0 where the summation is performed 
using binary arithmetic. As such, 2 N "~ l candidate codewords 
are available, leading to impractical implementations with 
exponential complexity. Similar to the rate-1 decoders, we 
use the Chase-II candidate generation to limit the number 
of candidates. Simulation results, presented in Section VI 
showed that flipping combinations of the four least-reliable 
bits caused only a minor degradation in error-correction per¬ 
formance for L < 16 and SPC code length > 4. The error- 
correction performance change was negligible for smaller 
L values. Increasing the number of least-reliable bits under 
consideration decreased the decoder speed to the point where 
not utilizing specialized decoders for SPC codes of length > 4 
yielded a faster decoder. 

We define q as an indicator function so that q — 1 when the 
parity check is satisfied and 0 otherwise. Using this notation, 
the reliabilities of the candidates, in an expanded form of ( fT2] i, 
are 


PM[) = PM; -1 - (1 - < 7 )|a,,[min I ]| 

PM'j = PM; - g|a,,[mini][ - |a,,[min 2 ]|, 

PM; = PM[j - g|a,.[mini][ - |a,,[min 3 ]|, 

PM' 3 = PM[j - g|a,.[mini]| - |a,,[min 4 ]|, 
pm; = pm; - |a v [min 2 ]| - |a v [min 3 ]|, 
pm; = pm; - |a,.[min 2 ]| - |a v [min 4 ]|, 
pm; = pm; - |a v [min 3 ]| - |a v [min 4 ]|, 

PM; = PM; - g|a,,[min 1 ]| 

- |a,,[min 2 ]| - |a v [min 3 ]| - |a,.[min 4 ]|; 

where PM; is reliability of the ML decision calculated ac¬ 
cording to 0. The remaining reliability values correspond to 
flipping an even number of bits compared to the ML decision 
so that the single-parity check constraint remains satisfied. 
Applying this rule when the input already satisfies the SPC 
constraints generates candidates where no bits are flipped, two 
bits are flipped, and four bits are flipped. Otherwise, one and 
three bits are flipped. 

When the list size L - 2, at most two candidates from any 
given source path are retained. Therefore, only the two most 
reliable candidates, corresponding to PMj, and PM',, need to be 
evaluated for each each source path, regardless of the length 
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of the SPC code. This is supported by the simulation results 
shown in Section ED 


E. Repetition Decoders 

A repetition decoder has two possible outputs: the all¬ 
zero and the all-one codewords and, according to ( p~2l >, their 
reliabilities are 

PM' 0 = PM'" 1 - Y HaAmaM, 

i 

PM f ! = pm;- 1 - ^ |1 - h(a,[i])\ |a,[/]|. 

i 

- PM' -1 - Y /t(-a„[t'])|a v [t']|. 

i 

where PMq and PMj are the path reliability values correspond¬ 
ing to the all-zero and all-one codewords, respectively. The 
all-zero reliability is penalized for every input corresponding 
to a ‘1’ estimate, i.e. negative LLR; and the all-one for every 
input corresponding to a ‘O’ estimate. These two equations can 
be rewritten as 

PM[) = PM' -1 - ^ | min(a v [i], 0)|, 

i 

PMj = PM^ 1 - Yj ImaxC^LO)!; 

i 

The ML decision is found according to arg max ( (PM{), 
which is the same as performing <®>. 


IV. Implementation 


In this section we describe the methods used to implement 
our proposed algorithm on an x86 CPU supporting SIMD 
instructions. We created two versions: one for CPUs that 
support the AVX instructions, and the other using SSE for 
CPUs that do not. For brevity, we only discuss the AVX 
implementation when both implementations are similar. In 
cases where they differ significantly, both implementations are 
presented. 

We use 32-bit floating-point (float) to represent the binary¬ 
valued p, in addition to the real-valued a, since it improves 


vectorization of the g operation as explained in Section IV-C 


A. Memory Layout for a Values 

Memory is organized into stages: the input to all constituent 
codes of length N v is stored in stage Si OS2 n v . Due to the 
sequential nature of the decoding process, only N v values need 
to be stored for a stage since old values are discarded when 
new ones are available. For example, the input to SPC node 
of size 4 in Fig. [2j will be stored in .SV overwriting the input 
to the repetition node of the same size. 

When using SIMD instructions, memory must be aligned 
according the SIMD vector size: 16-byte and 32-byte bound¬ 
aries for SSE and AVX, respectively. In addition, each stage 
is padded to ensure that its size is at least that of the SIMD 
vector. Therefore, a stage of size N v is allocated max(V,.,V) 


elements, where V is the number of a values in a SIMD vector, 
and the total memory allocated for storing a values is 

log 2 N— 1 

N + L Y max(2 i ,V) 

i=0 

LLR (float) elements; where the values in stage S\ ogl N are 
the channel reliability information that are shared among all 
paths and L is the list size. 

During the candidate forking process at a stage S), a path p 
is created from a source path s. The new path p shares all the 
information with ,s' for stages e [Si og Ar, Sf). This is exploited 
in order to minimize the number of memory copy operations 
by updating memory pointers when a new path is created |2|. 
For stages e [5o,5,-], path p gets its own memory since the 
values stored in these stages will differ from those calculated 
by other descendants of s. 

B. Memory Layout for ft Values 

Memory for p values is also arranged into stages. However, 
since calculating P v (4) requires both Pi and p r , values from 
left and right children are stored separately and do not over¬ 
write each other. Once alignment and padding are accounted 
for, the total memory required to store P values is 

log 2 N- 1 

L*(N + 2 Yj max(2',V)). 

!= 0 

As stage .S’| og ;V stores the output candidate codewords of the 
decoder, which will not be combined with other values, only 
L, instead of 2 L, memory blocks are required. 

Stored p information is also shared by means of memory 
pointers. Candidates generated at a stage 5,- share all informa¬ 
tion for stages e [Sq,Si). 

C. Rate-R and Rate-0 Nodes 

Exploiting the sign-magnitude floating-point representation 
defined in IEEE-754, allows for efficient vectorized imple¬ 
mentation of the / operation <0 Extracting the sign and 
calculating the absolute values in ([2]) become simple bit-wise 
AND operations with the appropriate mask. 

The g operation can be written as 

^(av[i],a,[i + A(v/2],/3/[i]) 

=a v [i + N v /2] + Pi[i\ * a v [i]- 

If we use P € {+0.0, -0.0} instead of {0,1), the g operation 
0 can be implemented as 

a v [i + N r /2] + pi[i] © a v [i], (15) 

Replacing the multiplication (*) with an XOR (©) operation 
in is possible due to the sign-magnitude representation of 
IEEE-754. 

Listing [4] shows the corresponding AVX implementations, 
originally presented in (n), ( 19 ), of the / and g functions 
using the SIMD intrinsic functions provided by GCC. For 

clarity of exposition, m256 is used instead of_m256 and 

the _mm256_ prefix is removed from the intrinsic function 
names. 
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Listing 4 Vectorized / and g functions 
template<unsigned int N v > 
void G(a * a in , a* a oul , j3* /j,„j { 
for (unsigned int i = 0; i < N v / 2; i += 8) { 
m256 a/ = load_ps(a„, + i); 
m256 a, = load_ps(a„, + i + N v / 2); 
m256 Pi = load_ps(/3„, + i); 
m256 a,' = xor_ps(j3/, a/); 
m256 a,, = add_ps(a,-, aj); 
store_ps(a OI „ + i, a 0 ); 


template<unsigned int N v > 
void F(a* a in , a* a out ) { 
for (unsigned int i = 0; i < N v / 2; i += 8) { 
m256 a/ = load_ps(a„, + i); 
m256 a r = load_ps(a„, + i + N v / 2); 
m256 sign = and_ps(xor_ps(a/, a r ), SIGN_MASK); 
m256 |a/| = andnot_ps(a/, SIGN_MASK); 
m256 a r | = andnot ps(a r , SIGN MASK); 
m256 a 0 = or_ps(sign, min_ps(|a/|, |a ( .|)); 
store_ps(a OI „ + i, a c ); 

} 

} 


Listing 5 Path reliability update in Rate-0 decoders. 

m256 ZERO = setl_ps(0.0); 
m256 PMv = ZERO; 

for (unsigned int i = 0; i < N v / 2; i += 8) { 

PMv = add_ps(PMv, min_ps(load_ps(a„, + i), ZERO)); 

) 

PM = £i PMv[i]; 


Rate-0 decoders set their output to the all-zero vector using 
store instructions. The path reliability (PM) calculation ( fl4| > is 
implemented as in Listing [5] 

D. Rate-1 Nodes 

Since ft e {+0.0, -0.0} and a values are represented using 
sign-magnitude notation, the threshold detection in 0 is 
performed using a bit mask (SIGN_MASK). 

Sorting networks can be implemented using SIMD instruc¬ 
tions to efficiently sort data on a CPU (30j |. For rate-1 nodes of 
length 4, a partial sorting network (PSN), implemented using 
SSE instructions, is used to find the two least reliable bits. For 
longer constituent codes, the reliability values are reduced to 
two SIMD vectors: the first, vo containing the least reliable 
bit and the second, v ls containing the least reliable bits not 
included in vo. When these two vectors are partially sorted 
using the PSN, mm 2 will be either the second least-reliable 
bit in i’o or the least-reliable bit in vi. 

E. Repetition Nodes 

The reliability of the all-zero output PM[, is calculated 
by accumulating the min(a,.[/], 0.0) using SIMD instructions. 
Similarly, to calculate PMj, max(a,[z], 0.0) are accumulated. 


F. SPC Nodes 


For SPC decoders of length 4, all possible bit-flip combina¬ 
tions are tested; therefore, no sorting is performed on the bit 
reliability values. For longer codes, a sorting network is used 
to find the four least-reliable bits. When L — 2, only the two 
least reliable bits need to be located. In that case, a partial 


sorting network is used as described in Section 1V-D 


Since the SPC code of length 2 is equivalent to the repetition 
code of the same length, we only implement the latter. 


V. Adaptive Decoder 

The concatenation with a CRC provides a method to per¬ 
form early termination analogous to a syndrome check in 
belief propagation decoders. In |[27), this was used to gradually 
increase the list size. In this work, we first decode using a Fast- 
SSC polar decoder, and if the CRC is not satisfied, switch to 
the list decoder with the target L max value. The latency of this 
adaptive approach is 

£(A max ) = JXL) + JXF)\ (16) 


where -C(L) and £(F) are the latencies of the list and Fast-SSC 
decoders, respectively. 

The improvement in throughput stems from the Fast-SSC 
having lower latency than the list decoder. Once the frame- 
error rate (FERp) at the output of the Fast-SSC decreases 
below a certain point, the overhead of using that decoder is 
compensated for by not using the list decoder. The resulting 
information throughput in bit/s is 

k 

T =-. (17) 

(1 - FER f )£(F) + FER f X(L) v 7 

Determining whether to use adaptive decoder depends on 
the expected channel conditions and the latency of the list 
decoder as dictated by L max . This is demonstrated in the 


comparison with the LDPC codes in Section VII 


VI. Performance 

A. Methodology 

All simulations were run on a single core of an Intel i7-2600 
CPU with a base clock frequency of 3.4 GHz and a maximum 
turbo frequency of 3.8 GHz. Software-defined radio (SDR) ap¬ 
plications typically use only one core for decoding, as the other 
cores are reserved for other signal processing functions ED- 
The decoder was inserted into a digital communication link 
with binary phase-shift keying (BPSK) and an additive white 
Gaussian noise (AWGN) channel with random codewords. 

Throughput and latency numbers include the time required 
to copy data to and from the decoder and are measured using 
the high precision clock from the Boost Chrono library. We 
report the decoder speed with turbo frequency boost enabled, 
similar to (32). 

We use the term polar-CRC to denote the result of concate¬ 
nating a polar code with a CRC. This concatenated code is 
decoded using a list-CRC decoder. The dimension of the polar 
code is increased to accommodate the CRC while maintaining 
the overall code rate; e.g. a (1024, 512) polar-CRC code with 
an 8-bit CRC uses a (1024, 520) polar code. 












—CRC = 8 —A— CRC = 32 


Fig. 3: The effect of CRC length on the error-correction 
performance of (1024,860) list-CRC decoders with L = 128. 

B. Choosing a Suitable CRC Length 

Using a CRC as the final output selection criterion sig¬ 
nificantly improves the error-correction performance of the 
decoder. The length of the chosen CRC also affects the error- 
correction performance depending on the channel conditions. 
Fig. [^demonstrates this phenomenon for an (1024,860) polar- 
CRC code using 8- and 32- bit CRCs and L = 128. Such 
a large list size was chosen to ensure that any observed 
differences are solely due to the change in the CRC length and 
could not be counteracted by increasing the list size further. 
The figure shows that the performance is better at lower £/,/ Nq 
values when the shorter CRC is used. The trend is reversed 
for better channel conditions where the 32-bit CRC provides 
an improvement > 0.5 dB compared to the 8-bit one. 

Therefore, the length of the CRC can be selected to improve 
performance for the target channel conditions. 

C. Error-Correction Performance 

The error-correction performance of the proposed decoder 
matches that of the SC-List decoder when no SPC constituent 
decoders of lengths greater than four are used. The longer SPC 
constituent decoders, denoted SPC-8+, only consider the four 
least-reliable bits in their inputs. This approximation only af¬ 
fects the performance when L > 2. Fig. [4] illustrates this effect 
by comparing the FER of different list sizes with and without 
SPC-8+ constituent decoders, labeled Dec-SPC-4 and Dec- 
SPC-4+, respectively. Since for L — 2, the SPC constituent 
decoders do not affect the error-correction performance, only 
one graph is shown for that size. As L increases, the FER 
degradation due to SPC-8+ decoders increases. The gap is 
< 0.1 dB for L — 8, but grows to ~ 0.25 dB when L is 
increased to 32. These results were obtained with a CRC of 
length 32 bits. The figure also shows the FER of the (2048, 
1723) LDPC code |25j after 10 iterations of offset min-sum 
decoding for comparison. 



—A— L = 2 - LDPC (2048,1723) 

-B- L = 32, Dec-SPC-4 - a- L = 32, Dec-SPC-4+ 
O L = 8, Dec-SPC-4 - €>- L — 8, Dec-SPC-4+ 


Fig. 4: FER of the polar-CRC (2048, 1723) code using the 
proposed decoder with different list sizes, with and without 
SPC decoders. 

While using SPC-8+ constituent decoders degrade the error- 
correction performance for larger L values, they decrease 
decoding latency as will be shown in the following section. 
Therefore, the decision regarding whether to employ them or 
not depends on the target FER and list size. 

D. Latency and Throughput 

To determine the latency improvement due to the new 
algorithm and implementation, we compare in Table [I] two 
unrolled decoders with an LLR-based SC-list decoder im¬ 
plemented according to the method described in |6j. The 
first unrolled decoder does not implement any specialized 
constituent decoders and is labeled “unrolled SC-list”. While 
the other, labeled “unrolled Dec-SPC-4,” implements all the 
constituent decoders described in this work, limiting the length 
of the SPC ones to four. We observe that unrolling the SC- 
list decoder decreases decoding latency by more than 50%. 
Furthermore, using the rate-0, rate-1, repetition, and SPC-4 
constituent decoders decreases the latency to between 63% 
(L = 2) and 18.9% (L = 32) that of the unrolled SC-list 
decoder. The speed improvement gained by using the proposed 
decoding algorithm and implementation compared to SC-list 
decoding varies between 18.4 and 11.9 times at list sizes of 2 
and 32, respectively^] The impact of unrolling the decoder is 
more evident for smaller list sizes; whereas the new constituent 
decoders play a more significant role for larger lists. 

Table [I] also shows the latency for the proposed decoder 
when no restriction on the length of the constituent SPC 
decoders is present, denoted “Unrolled Dec-SPC-4+”. We 
note that enabling these longer constituent decoder decreases 

'The gains over an LL-based SC-list decoder are even more significant: 
such a decoder has a latency of 20.5 ms for L = 32, leading the proposed 
decoder to have 47 times the speed. 
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TABLE I: Latency (in /is) of decoding the (2048, 1723) polar- 
CRC code using the proposed method with different list sizes, 
with and without SPC decoders compared to that of SC-list 
decoder. Speedups compared to SC-List are shown in brackets 


Decoder 


L 


2 

8 

32 

SC-List 

558 

1450 

5145 

Unrolled SC-list 

193 (2.9x) 

564 (2.6x) 

2294 (2.2x) 

Unrolled Dec-SPC-4 

30.4 (18.4x) 

97.5 (14.9x) 

433 (11.9x) 

Unrolled Dec-SPC-4+ 

26.3 (21.2x) 

80.2 (18.lx) 

N/A 


TABLE II: Information throughput of the proposed adaptive 
decoder with L max = 32. 


L 

info. T/P (Mbps) 


3.5 dB 

4.0 dB 

4.5 dB 

8 

32.8 

92.1 

196 

32 

8.6 

33.0 

196 


latency by 14% and 18% for L — 2 and 8, respectively. Due 
to the significant loss in error-correction performance, we do 
not recommend using the SPC-8+ constituent decoders for 
L > 8 and therefore do not list the latency of such a decoder 
configuration. 

The throughput of the proposed decoder decreases almost 
linearly with L. For L = 32 with a latency of 433 /is, the 
information throughput is 4.0 Mbps. As mentioned in Sec¬ 
tion |V] throughput can be improved using adaptive decoding 
where a Fast-SSC decoder is used before the list decoder. The 
throughput results for this approach are shown for L — 8 and 
L = 32 in Table |Ii| As Eb/No increases, the Fast-SSC succeeds 
more often and the impact of the list decoder on throughput 
is decreased, according to Hz). until it is becomes negligible 
as can be observed at 4.5 dB where the throughput for both 
L = 8 and 32 is the same. 


VII. Comparison with LDPC Codes 


A. Comparison with the (2048, 1723) LDPC Code 

We implemented a scaled min-sum decoder for the (2048, 
1723) LDPC code of (25| |. To the best of our knowledge, this 
is the fastest software implementation of decoder for this code. 
We used early termination and maximum iteration count of 10. 
To match the error-correction performance at the same code 
length, an adaptive polar list-CRC decoder with a list size of 
32 and a 32-bit CRC was used as shown in Fig. [4] 

Table III presents the results of the speed comparison 
between the two decoders. It can be observed that the pro¬ 
posed polar decoder has lower latency and higher throughput 
throughout the entire Eb/No range of interest. The throughput 
advantages widens from seven to 78 times as the channel 
conditions improve from 3.5 dB to 4.5 dB. The LDPC decoder 
has three times the latency of the polar list decoder. 


TABLE III: Information throughput and latency of the pro¬ 
posed adaptive decoder with L max = 32 compared to the (2048, 
1723) LDPC decoder. 


Decoder 

Latency (ms) 

info. T/P (Mbps) 


3.5 dB 

4.0 dB 

4.5 dB 

LDPC 

1.6 

u 

2.0 

2.5 

This work 

0.44 

8.6 

33.0 

196 


The standard defines three code lengths: 1944, 1296, 648; 
and four code rates: 1/2, 2/3, 3/4, 5/6. The work in (32) 
implements decoders for codes of length 1944 and all four 
rates using a layered offset-min-sum decoding algorithm with 
five iterations. 

Fig. 0 shows the FER of these codes using a 10-iteration, 
flooding-schedule offset min-sum decoder that yields slightly 
better results than the five iteration layered decoder (32) . The 
figure also shows the FER of polar-CRC codes (with 8-bit 
CRC) of the same rate, but shorter: N = 1024 instead of 1944. 
As can be seen in the figure, when these codes were decoded 
using a list CRC decoder with L = 2, their FER remained 
within 0.1 dB of the LDPC codes. Specifically, for all codes 
but the one with rate 2/3, the polar-CRC codes have better FER 
than their LDPC counterparts down to at least FER = 2 x 10 '. 
For a wireless communication system with retransmission such 
as 802.11, this constitutes the FER range of interest. These 
results show that the FER of N - 1024 is sufficient and that 
it is unnecessary to use longer codes to improve it further. 

The latency and throughput of the LDPC decoders are 
calculated for when 524,280 information bits are transferred 
using multiple LDPC codewords in (32). Table IV compares 


the speed of LDPC and polar-CRC decoders when decoding 
that many bits on an Intel Core i7-2600 with turbo frequency 
boost enabled. The latency comprises the total time required 
to decode all bits in addition to copying them from and to 
the decoder memory. The results show that the proposed list- 
CRC decoders are faster than the LDPC ones. The decoder in 
1 321 meets the minimum throughput requirements set in 1261 
for codes of rate 1/2 and for two out of three cases when 
the rate is 3/4 (MCS indexes 2 and 3). Our proposed decoder 
meets the minimum throughput requirements at all code rates. 
This shows that in this case, a software polar list decoder 
obtains higher speeds and similar FER to the LDPC decoder, 
but with a code about half as long. Since the decoder operates 
on individual frames (intra-frame parallelism using SIMD), the 
latency per frame is significantly lower and is less than 15 /is 
for the tested codes as shown in the table. It should be noted 
that neither decoder employs early termination: the LDPC 
decoder in 1321 always uses 5 iteration, and the list-CRC 
decoder does not utilize adaptive decoding. The number of 
LDPC and polar code frames required to transmit the 524,280 
information bits at each code rate are also shown in Table IfVl 


B. Comparison with the 802.1In LDPC Codes 

The fastest software LDPC decoders in literature are those 
of [321, which implement decoders for the 802.1 In standard 
using the same Intel Core i7-2600 as this work. 


VIII. Conclusion 

In this work, we described an algorithm to significantly 
reduce the latency of polar list decoding, by an order of magni¬ 
tude compared to the prior art when implemented in software. 
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E b /N 0 (dB) 


- LDPC, R = 1/2- 

—A- LDPC, R = 2/3 - A- 
— B- LDPC, R = 3/4 - 0- 
—0— LPDC, R = 5/6 -©- 


Polar-CRC, R = 1/2 
Polar-CRC, R = 2/3 
Polar-CRC, R = 3/4 
Polar-CRC, R = 5/6 


Fig. 5: Frame-error rate of the proposed decoders of length 
1024 compared with those of the 802.1 In standard of length 
1944. 


TABLE IV: Information throughput and latency of the pro¬ 
posed list decoder compared with the LDPC decoders of [ |32| 
when estimating 524,280 information bits. 


Decoder 

N 

# of IV-bit 

Rate 

Latency (ms) 

info. T/P 



frames 


total 

per frame 

(Mbps) 

proposed 

1944 

540 

1/2 

17.4 

N/A 

30.1 

1024 

1024 

1/2 

13.8 

0.014 

38.0 

0 

proposed 

1944 

405 

2/3 

12.7 

N/A 

41.0 

1024 

768 

2/3 

10.0 

0.013 

52.4 

(32] 

1944 

360 

3/4 

11.2 

N/A 

46.6 

proposed 

1024 

683 

3/4 

8.78 

0.013 

59.6 

0 

proposed 

1944 

324 

5/6 

9.3 

N/A 

56.4 

1024 

615 

5/6 

6.2 

0.010 

84.5 


We also showed that polar list decoders may be suitable for 
software-defined radio applications as they can achieve high 
throughput, especially when using adaptive decoding. Further¬ 
more, when compared with state-of-the art LDPC software 
decoders from wireless standards, we demonstrated that polar 
codes could achieve at least the same throughput and similar 
FER, while using significantly shorter codes. Future work will 
focus on implementing unrolled list decoders as application- 
specific integrated circuits (ASIC), which we expect to have 
throughput approaching 1 Gbps. 
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